Airbnb Pricing Analysis Case Study

Created by: Malcomb Brown
   Updated: 2023-01-11

Import libraries

RUN SETUP SCRIPT FIRST!!!!

Setups project subfolder, establishes connection with the database, loads SQL Magic, and imports the dataset.

Initialize ProjectSetup object

Save the url of the dataset

Table Schema

Column name Description
id Listing id
name Name of listing
host_id Host id
host_name Name of host
neighbourhood_group Neighbourhood group the listing is in
neighbourhood Neighbourhood the listing is in
latitude Latitude coordinate of listing location
longitude Longitude coordinate of listing location
room_type Room type of the listing
price Price of the listing
minimum_nights Minimum number of nights stay for listing
number_of_reviews Number of reviews for listing
last_review Date of the latest review
reviews_per_month Number of reviews per month of listing
calculated_host_listings_count Number of listings the host has
availability_365 The availability of the listing in the next 365 days
number_of_reviews_ltm Number of reviews of listing in last 12 months
license If host is licensed

Extract CSV file

Null values for 'name' and 'host_name' are unnecessary columns because of the 'host_id'.
Remove the 'last_review', 'reviews_per_month', and 'license' columns
Optimize the data types before uploading.

Inspect the dataframe

Transform the data

Reinspect the data

No Null values remain

Top Categorical Variables

Frequency in parenthesis.
Neighborhood Group: Manhattan (16847)
Neighborhood: Bedford-Stuyvesant (2779)
Room Type: Entire home/apt (22761)

There appears to be record(s) that have no price (0) value.

Upload the cleaned verified dataset to a MySQL database for analysis

Initial Assumptions

Questions

Find the median

There are outliers, in price, too the higher, right, side of the dataset
Dataset has a positive or right skew.

Visualize the distribution with histogram

What is the most common price point?

What percent of listings are priced below $200.00?

70.83% of all listings are priced between 0 and 199.00 dollars

Quick Feature Summary

           Listings: 39,851
              Hosts: 26,263
Neighborhood Groups: 5
      Neighborhoods: 244
         Room Types: 4
Average Price (USD): 197.70
Minimum Price (USD): 10.00
 Median Price (USD): 130.00
Maximum Price (USD): 16,500.00

The price of 150.00 is the most common price point, appearing 1164 times.

Location

Top 10 Neighbohoods by average price

Visualize the Top 10 Neighborhoods

Prospect Park and Fort Wadsworth have the highest average price for listings in their neighborhood. The values are skewed do to the low number of listings for each.

Prospect Park has 7 listings and Fort Wadsworth only has 1.

What are the bottom 10 Neighborhoods by average listing price?

With some neighborhoods having only 1 listing, comparing average prices by this is highly susceptible to outliers.

Average price by 'neighborhood_group'

As expected, Manhattan listings have the highest average price.
Manhattan also has the most listings
Unexpectedly, Staten Island, has, by far, the least number of listings and the lowest max price but has the third highest average price.

What percentage of listings does each neighborhood group have?

79.4% of all listings are located in either Manhattan or Brooklyn.

Which neighborhood groups have the most availabilities?

75.7% of all listings with any availability over the next year are located in either Manhattan or Brooklyn.
86.1% of all listings with no availabilities over the next year are in Manhattan and Brooklyn.

Visualize price by location

Room Type

Hotel rooms have the highest average prices.
Entire Home/Apts have the second highest average price, more than Private and Shared rooms combined.

Room Type Outliers

Hotel rooms have more consistent pricing, havint the fewest outliers

Entire home/apt and Private room account for 98.2% of all listings

Room type and Location

Availability by Room Type

Reviews

Price not significantly correlated to:

Takeaways

Overall

Location

Room Type

Reviews

Recommendations

1. Further Analyze Location

We should gather more data related to 'neighborhood_group' locations such as crime data and proximity to cultural or sporting venues to better understand what drives location's influence on price. The data required is public so it can be easily collected and at minimal cost. The collection and analysis can be completed in two weeks.

2. Feature Selection

A machine learning model will be required to supply our clients with a rental price suggestion. To this end, feature selection and normalization will have to be planned and executed. The features will include neighborhood, neighborhood group, and room type. Selection and standardization can be accomplished with in a week.

3. ML Model Development

Suggestion #2 is a prerequisite. Model selection, testing, and deployment will take approximately four weeks not to include regular testing and refactoring as new data becomes available.